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Abstract 


Reinforcement  Learning  Methods  (RLMs)  typ¬ 
ically  select  candidate  solutions  stochasticdly 
based  on  a  credibility  space  of  hypotheses 
which  the  RLM  maintains,  either  implicitly  or 
explicitly.  RLMs  typically  have  both  inductive 
and  deductive  aspects:  they  inductively 
improve  their  credibility  space  on  a  stage-by 
stage  basis;  they  deductively  select  an  appro¬ 
priate  response  to  incoming  stimuli  using  their 
credibility  space.  In  this  sense,  RLMs  share 
some  learning  attributes  in  common  with 
active,  incremental  concept  learners.  Unlike 
some  concept  learners  that  employ  determinis¬ 
tic  procedures  for  selecting  hypotheses,  how¬ 
ever,  the  evaluations  of  hypotheses  provided  to 
RLMs  are  often  uncertain,  either  due  to  noisy 
environments,  or  due  to  summary  evaluations 
which  occur  after  a  sequence  of  learner-envi¬ 
ronment  interactions.  This  paper  examines 
issues  of  inductive  learning  bias  in  this  context 
experimentally.  Specifically,  the  paper 
addresses  inductive  learning  biases  in  the  con¬ 
text  of  a  simple  RLM  called  a  Collective 
Learning  Automaton  (CLA).  The  CLA  learns 
the  shortest  path  through  a  small  network.  The 
research  points  out  some  of  the  difficulties  of 
finding  performance  measures  that  indicate  the 
strongest,  correct  biases  for  the  automaton. 


1  INTRODUCTION 

Over  the  past  few  years,  there  has  been  a  growing  interest 
in  the  effects  of  bias  in  learning  algorithms.  In  inductive 
concept  learning,  Mitchell  considers  bias  to  be  the 
expressed  preference  of  the  learner  for  considering  one 
hypothesis  of  a  classification  rule  (or  a  generalization  rule) 


over  another  (Mitchell,  1980).  Mitchell  points  out  that  “an 
unbiased  learning  system’s  ability  to  classify  new 
instances  is  no  better  than  if  it  simply  stored  all  the  train¬ 
ing  instances  and  performed  a  table  lookup  when  asked  to 
classify  a  subsequent  instance”  (Mitchell,  1980,  pp.  1). 
Bias,  as  defined  in  this  sense,  is  necessary  for  any  induc¬ 
tive  learning  algorithm. 

Within  concept  learning  algorithms  there  are  two  funda¬ 
mental  types  of  bias:  language  bias  and  procedural  bias^ 
(Utgoff,  1986;  Rendell,  1986;  Gordon,  1990).^  Language 
biases  are  preferences  which  determine  the  expression  of 
hypotheses  of  the  target  concept.  Procedural  biases  are 
preferences  which  affect  the  traversing  of  the  search  space. 
Procedural  biases  also  include  halting  rules.  Both  lan¬ 
guage  biases  and  procedural  biases  can  affect  a  learning 
algorithm’s  speed  for  finding  a  classification  rule  that  is 
close  to  the  target  concept.  The  language  bias  reduces  the 
size  of  the  search  space  by  constraining  the  number  of  pos¬ 
sible  formulations  of  the  hypothesis;  the  procedural  bias 
reduces  the  amount  of  traversal  through  the  search  space 
by  constraining  the  search  method. 

Closely  associated  with  notion  of  inductive  bias  are  the 
notions  of  strength  and  correctness  (Utgoff,  1986).  To 
date,  the  definition  of  strength  refers  more  to  language  bias 
than  procedural  bias.  The  strength  of  a  language  bias  cor¬ 
responds  to  the  entire  size  of  the  hypothesis  space  that  the 
learner  can  generate  given  no  constraining  procedural  bias 
(e.g.,  exhaustive  search  with  labeled  instances).  The 
strength  of  the  language  bias  can  be  increased  either  by 
reducing  the  size  of  the  representational  space  or  by  trans¬ 
forming  the  grammar  to  one  that  is  less  expressive.  Not  all 
hypotheses  that  may  be  generated  from  a  language 
description  may  be  acceptable  hypotheses  of  the  target 
concept,  however.  Language  biases  that  permit  the  forma- 


^  Procedural  bias  is  also  called  algorithmic  bias, 

^  The  biases  within  the  problem  space  may  be  different  from  those  within 
the  algorithm.  For  example,  instances  examined  by  the  learner  may  be 
expressed  in  a  language  which  is  much  richer  than  the  learner’s  language. 
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tion  of  hypotheses  that  do  describe  the  target  concept  are 
called  correct  biases.^ 

In  a  general  sense,  procedural  bias  constrains  the  breadth, 
the  direction,  and  Ae  duration  of  the  search  for  the  classifi¬ 
cation  rule.  A  strong  procedural  bias  minimizes  this  trajec¬ 
tory  through  the  search  space.  An  incorrect  procedural 
bias  over-constrains  or  under-constrains  this  search  so  that 
learner  cannot  find  a  correct  classification  rule.  A  powerful 
concept  learner  is  one  that  uses  as  strong  a  bias  as  possi¬ 
ble,  for  both  its  language  and  procedural  biases,  without 
sacrificing  correctness. 

Since  many  learning  algorithms  have  an  inductive  compo¬ 
nent,  the  analysis  of  learning  in  terms  of  language  and  pro- 
cedi^  bias  applies  broadly  to  many  algorithms  other  than 
traditional  concept  learners.  For  example,  a  large  number 
of  active  incremental  learners,  which  are  generally  not 
considered  to  be  traditional  concept  learners  employ  what 
is  referred  to  as  Reinforcement  Learning  Methods 
(RLMs)  (Whitehead  and  Ballard,  1990).  RLMs  have  an 
inductive  component  in  that  they  constantly  revise  the 
credibilities  of  their  hypotheses  based  on  experience. 
These  learners  then  deduce  a  response  from  an  input  stim¬ 
ulus  based  on  their  current  hypothesis  or  set  of  hypotheses. 
RLMs  are  often  used  to  learn  first-order  decision 
sequences  in  dynamical  systems.  Such  algorithms  include 
genetic  algorithm-based  systems  (Holland,  1975,  De  Jong, 
1975)  such  as  Grefenstette’s  SAMUEL  (Grefenstette, 
1988),  Artificial  Neural  Networks  (ANNs)  (Rumelhart, 
1986)  such  as  those  consisting  of  Widrow’s  AD  ALINE 
units  (Widrow,  1985),  and  Q-Learners  (Watkins,  1989). 
This  paper,  in  particular,  examines  a  simple  example  of  a 
RLM  called  a  Collective  Learning  Automaton  (CLA) 
(Bock  1992). 

Viewing  RLMs  as  having  both  inductive  and  deductive 
components  is  supported  by  other  researchers,  including 
those  studying  learning  automata  (Narendra  and  Thatha- 
char,  1989)  and  those  working  in  information  theoretics  as 
applied  to  inductive  and  deductive  inference  (e.g.,  see 
Watanabe,  1960).  Furthermore,  these  researchers  believe 
that  both  forms  of  inference  are  necessary  for  learning: 
“Inductive  and  deductive  inference  do  not  contradict  but 
merely  complement  each  other  and  both  are  found  to  be 
essential  for  learning  processes”  (Narendra  and  Thatha- 
char,  1989,  pp.  15).  Watanabe  states:  “Inductive  inference 
contains,  as  a  necessary  ingredient,  a  constant  comparison 
of  the  deductive  consequence  from  a  hypothesis  with  the 
experiment.  Accordingly,  the  model  theory  of  inductive 
inference  must  permit  deductive  inference  to  play  a  corre¬ 
sponding  role  within  its  framework”  (Watanabe,  1960,  pp. 
208). 


^  In  the  PAC  teaming  frameworic  (Valiant,  1984),  hypotheses  describe 
the  target  concept  within  limits  of  accuracy  specified  as  part  of  the  proce¬ 
dural  bias). 


A  RLM  is  similar  to  an  active  incremental  concept  learner 
that  is  inferring  a  single  concept.  The  purpose  of  the  con¬ 
cept  learner  is  to  find  a  final  hypothesis  sufficiently  close 
to  the  target  hypothesis  that  covers  the  positive  instances 
the  learner  has  observed.  The  purpose  of  a  typical  RLM  is 
to  seek  one  solution  (or  a  few)  to  a  decision  sequence 
problem.  In  the  latter  case,  the  final  hypothesis  covers  only 
a  small  fraction  of  the  possible  sequences.  If  the  space 
being  covered  is  very  small  and  the  instances  being 
received  only  have  the  label  of  “positive”  or  “negative,” 
then  any  inference  procedure,  deductive  or  inductive, 
would  probably  do  no  better  than  random  search.  Instead, 
RLMs  usually  receive  evaluations  which  measure,  in  a 
sense,  the  “degree  of  positiveness,”  (or  better,  the  “degree 
of  credibility”)  of  an  entire  solution  (or  trial  hypothesis) 
which  can  be  thought  of  as  a  group  of  instances.  A  simple 
analogous  problem  for  a  concept  learner  would  be  to  gen¬ 
erate  the  set  of  positive  instances  covered  by  its  current 
hypothesis  for  evaluation.  A  summary  score  might  indi¬ 
cate  the  actual  fraction  of  positive  instances  in  the  set.  A 
difference  between  the  concept  learner  and  the  RLM  in 
this  case  is  that  the  RLM  examines  a  sequential  relation¬ 
ship  among  the  decisions  it  generates  whereas  the  concept 
learner  applies  a  “same  member  of’  type  of  relationship. 

In  the  RLM,  each  individual  decision  is  not  directly  evalu¬ 
ated  as  “good”  or  “bad.”  The  problem  for  the  RLM  then 
becomes  one  of  inferring  what  makes  up  the  constituent 
parts  of  the  best  solution  (decisions  in  the  final  hypothesis) 
from  summary  evaluations  of  trial  hypotheses,  rather  than 
Uying  to  infer  a  final  hypothesis  through  the  direct  evalua¬ 
tion  of  constituent  members  which  are  labelled  as  “posi¬ 
tive”  or  “negative”  instances.  This  inference  problem  of 
finding  out  the  constituent  parts  of  the  best  solution  is 
commonly  referred  to  as  the  credit  assignment  problem. 


There  are  basically  two  approaches  for  studying  biases  in 
learning  systems.  One  approach  is  to  maintain  a  set  of 
biases  while  the  learner  performs  a  task,  and  then  before 
attempting  the  task  again,  search  for  a  better  set  based  on 
the  learner’s  overall  performance.  The  other  approach  is  to 
dynamically  adjust  a  set  of  biases  during  the  learning  task 
based  on  intermediate  evaluations  of  performance.  This 
paper  calls  the  first  approach  inherited  bias  (or  fixed 
bias).  The  second  approach  is  referred  to  in  the  literature 
as  dynamic  bias  adjustment  (or  shift  of  bias)  (Gordon, 

1990;  Rendell,  1987;  Schlimmer,  1987;  Utgoff,  1982, 

1986).  Both  of  these  approaches  have  biological  analogs. 

For  example,  the  range  of  frequencies  that  a  creature  can  ;  TAB 
hear  is  usually  a  genetically  determined  trait  that  does  not  .ounced 
improve  due  to  learning  over  the  creature’s  life-span.  Not  ificatioti 

all  characteristics  fit  this  category,  however.  A  creature - z 

having  a  fairly  long  life-span  must  be  able  to  dynamically 
adjust  its  initial  biases  in  order  to  adapt  itself  to  changing 


circumstances.  For  example,  biases  appropriate  to  one  ibution  [ 
stage  of  development,  may  be  inappropriate  at  another.  So  ~ 
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we  can  think  of  the  dynamic  shifting  of  bias  as  analogous 
to  bias  changes  that  occur  during  various  stages  of  individ¬ 
ual  development. 

Since  of  the  primary  objective  of  this  research  is  to  exam¬ 
ine  the  effects  of  learning  biases,  and  their  interactions,  on 
the  performance  of  a  RLM,  this  study  focuses  on  examin¬ 
ing  the  effects  of  inherited  biases.  Being  able  to  compare 
the  effect  of  making  one  biasing  assumption  over  another 
is  possible  when  the  combination  of  selected  biases 
remains  constant  over  a  learning  task.  Using  dynamic  bias 
adjustment  involves  additional  issues.  One  major  issue 
that  needs  to  be  addressed  is  how  to  modify  the  state  of  the 
learner  so  that  it  can  continue  to  learn  with  a  new  set  of 
biases.  This  issue  is  complicated  by  the  fact  that  most 
RLMs  do  not  maintain  an  episodic  memory  of  their  expe¬ 
riences,  whereas  many  concept  learners  save  instance 
information  so  that  biases  can  be  adjusted  retroactively,  if 
necessary,  using  backtracking.  There  are  so  many  ways  of 
implementing  dynamic  bias  adjustment  that  discovering 
the  best  method  for  a  RLM  is  a  study  unto  itself. 

There  are  several  questions  of  interest  in  studying  inher¬ 
ited  biases  in  RLMs,  assuming  that  a  learning  task  repre¬ 
sents  the  life-span  of  the  RLM.  For  example,  what  are 
characteristic  language  and  procedural  biases  in  the  RLM? 
If  we  decompose  the  problem  into  two  subsystems,  one 
subsystem  for  modeling  the  learner,  and  the  other  sub¬ 
system  for  selecting  the  inherited  biases,  how  should  the 
two  subsystems  be  designed?  What  performance  mea- 
sure(s)  should  be  sent  from  the  learning  subsystem  to  the 
bias  search  subsystem  which  assigns  a  the  learner  it’s 
inherited  biases?  Are  there  any  interactions  between  the 
procedural  and  representational  biases?  This  work  repre¬ 
sents  some  preliminary  results  in  studying  combinations  of 
biases  in  a  simple  RLM  without  examining  the  bias  search 
subsystem. 

The  next  section  provides  background  information,  intro¬ 
ducing  terminology  that  frames  the  question  of  inductive 
biases  in  the  context  of  RLMs.  This  section  also  discusses 
a  way  of  viewing  the  strength  of  a  bias  in  empirical  terms 
as  information  compression.  Section  3  gives  a  quick  over¬ 
view  of  the  CLA;  Section  4  describes  the  particular  short¬ 
est  path  problem  being  examined,  including  an  outline  of 
the  experimental  design;  Section  5  presents  the  experi¬ 
mental  results,  and  Section  6  gives  a  brief  conclusion. 


2  INDUCTIVE  BIASES 


2.1  MEASURING  BIAS  STRENGTH 

The  definition  of  strength  of  a  language  bias  given  at  the 
beginning  of  the  introduction  permits  us  to  perform  some 
analytic  evaluation  of  the  limiting  performance  of  a  con¬ 


cept  learner.  However,  if  we  automate  the  search  for  the 
strongest,  correct  bias  we  would  like  to  find  some  perfor¬ 
mance  measures  which  would  empirically  allow  us  to 
evaluate  biasing  assumptions.  This  is  especially  true  if  we 
consider  the  problem  of  choosing  a  good  procedural  bias. 

In  RLMs,  it  is  sometimes  useful  to  examine  a  component 
of  language  bias  called  representational  bias  (Gordon, 
1990;  Rendell,  1987;  Schlimmer,  1987;  Utgoff,  1982, 
1986;  Mitchell,  1980).  Representational  bias  defines  the 
choice  of  atoms  or  primitives  in  the  hypothesis  langua¬ 
ge.  Very  often,  the  grammar  for  expressing  hypotheses  is 
defined  in  the  paradigm  of  the  RLM.  For  example,  a  RLM 
may  always  generates  a  decision  sequence,  or  trial  hypoth¬ 
esis,  having  a  specified  format.  A  trial  hypothesis  is  gener¬ 
ated  from  the  RLM’s  memory.  This  memory  could  be 
expressed  as  a  linear  expression,  a  set  of  productions,  or 
perhaps  a  state  transition  table.  The  choice  of  terms  used 
in  the  RLM’s  memory  represents  a  representational  bias 
within  the  learner.  If  we  examine  a  problem  in  which  the 
best  representational  bias  is  known,  then  we  can  investi¬ 
gate  which  performance  measures  can  be  used  to  measure 
a  strong,  correct  representational  bias. 

A  major  procedural  bias  within  RLMs  is  the  amount  of 
change  made  to  the  memory  at  each  stage  of  learning.  Sev¬ 
eral  authors  (e.g.,  Sutton,  1989)  call  this  factor  the  learning 
rate."^  The  best  setting  of  this  factor  cannot  usually  be 
determined  from  the  problem  definition.  However,  by 
examining  various  levels  of  this  factor  in  conjunction  with 
different  representations,  we  can  test  the  effect  of  this  bias 
on  selected  performance  measures. 

2.2  INDUCTIVE  COMPRESSION 

Generally  speaking,  inductive  learners  compress  informa¬ 
tion.  Watanabe  points  out  that  there  are  several  steps 
involved  in  the  information  compression  (Watanabe, 
1971).  Using  different  terms  than  Watanabe’s,  we  can  con¬ 
sider  the  choice  of  representation  of  the  language  as  the 
first  compression  step,  the  grammar  of  the  language  as  the 
next  compression  step,  and  finally,  the  inductive  compres¬ 
sion  specified  by  the  procedure  as  the  third  compression 
step. 

Representation  ->  Grammar  ->  Inductive  Procedure 

Watanabe  also  points  out  that  there  are  several  ways  to 
arrive  at  the  same  final  level  of  compression.  For  example, 
all  the  burden  can  be  placed  on  initiily  compressing  infor¬ 
mation  into  a  single  concept  class,  in  which  case  a  gram¬ 
mar  is  not  needed.  In  this  case,  aU  of  the  effort  is  applied  to 
finding  the  best  representation  for  the  problem.  Altema- 


^  This  lenn  learning  rate  is  not  used  in  this  paper  because  many  factors 
actually  affect  the  rate  of  learning. 
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lively,  very  little  compression  can  be  used  in  developing 
primitives,  in  which  case  more  burden  is  placed  on  devel¬ 
oping  a  grammar  and  applying  an  inductive  inference  pro¬ 
cess.  In  this  second  case,  the  resulting  classification  rule 
may  be  more  complex  (e.g.,  more  terms)  than  the  case 
where  more  effort  is  expended  in  finding  a  good  represen¬ 
tation.  Watanabe  speculates  that  the  most  desirable  situa¬ 
tion  is  to  have  a  “well-balanced  distribution  of  information 
compression  over  different  steps”  (Watanabe,  1971,  pp. 
567). 

2  J.l  Language  Bias 

Notice  that  the  first  two  steps  correspond  to  defining  a  lan¬ 
guage  bias.  When  we  select  a  language  bias,  we  are  in 
effect  compressing  information.  Unless  the  language  bias 
is  chosen  either  arbitrarily  or  from  the  experience  of  the 
designer  of  the  learning  algorithm,  a  number  of  steps  are 
required  to  compress  raw  data  into  a  language  specifica¬ 
tion.  We  can  define  the  strength  of  a  language  bias  in  terms 
of  the  amount  information  compression  that  occurs  over 
these  steps.  However,  most  analyses  of  algorithms  do  not 
consider  the  cost  of  this  phase;  that  is,  an  initial  language 
bias  is  often  assumed  analytically  to  be  a  “given.”  Thus, 
assuming  that  we  can  perform  compression  instanta¬ 
neously,  then  we  can  define  the  strength  of  a  language  bias 
as  follows:  If  a  language  bias,  A,  performs  more  informa¬ 
tion  compression  over  the  same  set  of  data  than  a  lan¬ 
guage  bias,  B,  then  A  is  a  stronger  language  bias  than  B. 

This  definition  is  an  operational  extension  of  the  original 
definition  of  strength  of  a  language  bias  which  refers  to  the 
size  of  the  hypothesis  space.  Given  the  same  raw  data,  a 
language  specification  A,  results  in  a  smaller  hypothesis 
space  than  some  other  language  specification  B,  if  and 
only  if  the  language  A  performs  more  information  com¬ 
pression  of  the  raw  data  than  language  B.  It  follows  that 
the  bias  of  A  is  stronger  than  the  bias  of  B. 

Notice  that  the  strength  of  a  bias  is  relative  to  the  original 
data.  Suppose  that  we  start  with  two  different  sets  of  raw 
data,  set  A,  and  set  B  which  is  an  elaboration  of  set  A,  and 
we  arrive  at  the  exact  same  language  specification,  L  (i.e.. 
La  =  L3)  for  both  sets.  The  strengths  of  the  language 
biases  are  not  the  same  even  though  the  sizes  of  the 
hypotheses  spaces  are.  The  second  specification  com¬ 
presses  more  information  relative  to  the  original  data  set 
B. 

Also  notice  that  this  definition  ignores  the  notion  of  cor¬ 
rectness,  which  requires  some  target  concept  or  problem 
definition.  For  example,  when  we  compress  the  raw  data 
sets  A  and  B  to  the  same  language  (as  in  the  last  example), 
the  compression  for  B  may  be  greater,  but  the  compression 
may  also  be  incorrect  (i.e.,  too  strong). 


2.2.2  Procedural  Bias 

To  date,  little  attention  has  been  directed  toward  the  defini¬ 
tion  of  the  strength  of  a  procedural  bias.  We  extend  the 
definition  of  language  bias  to  define  the  strength  of  a  pro¬ 
cedural  bias.  Since  inference  procedures  usually  occur 
over  several  steps,  we  can  define  the  strength  of  a  proce¬ 
dural  bias  in  terms  of  its  rate  of  information  compression: 
If  a  procedural  bias  A  takes  a  fewer  number  of  stages  to 
perform  the  same  amount  of  compression  as  procedural 
bias  B,  then  procedural  bias  A  is  stronger  than  procedural 
bias  B,  Alternatively:  If  a  procedural  bias  A  compresses 
more  information  in  the  same  number  of  steps  as  proce¬ 
dural  bias  B,  then  procedural  bias  A  is  stronger  than  pro¬ 
cedural  biasB. 

Procedural  bias  essentially  differs  from  language  bias  in 
that  shifting  a  language  bias  usually  implies  uniform 
changes  throughout  the  learning  process.  For  example, 
dropping  a  term  from  the  language  must  be  done  in  all 
hypotheses  and  all  logic  leading  to  those  hypotheses. 
Changes  in  procedural  bias  are  done  in  the  context  of  effi¬ 
ciently  finding  a  concept  description  or  a  problem  solution 
given  some  language  bias  constraint. 

2.3  A  RLM’S  CREDIBILITY  SPACE 

A  concept  learner  that  receives  labelled  instances  from  a 
generator  can,  in  principle,  maintain  a  version  space 
(Mitchell,  1977)  of  hypotheses  that  are  consistent  with  all 
of  the  instances  seen  so  far.  For  example,  Mitchell  main¬ 
tains  in  his  Candidate  Elimination  Algorithm  a  maximally 
specific  set  and  a  maximally  general  set  of  hypotheses  so 
that  not  all  hypotheses  need  to  be  considered  explicitly.  As 
learning  progresses,  the  version  space  becomes  smaller. 
Haussler  points  out  that  one  of  the  problems  with  Mitch¬ 
ell’s  approach  is  that  the  storage  space  for  this  set  of 
hypotheses  can  still  increase  exponentially  in  size.  Haus- 
sler’s  analysis  shows  that  it  is  not  necessary  for  a  concept 
learner  to  maintain  an  explicit  version  space  at  all  (Haus¬ 
sler,  1987).  By  examining  a  sufficient  number  of  instances, 
the  learner  can  develop  a  hypothesis  that  becomes  e-close 
to  the  target  concept  with  some  laige  probability  1-5. 

One  of  the  important  assumptions  behind  Haussler ’s  anal¬ 
ysis,  however,  is  that  the  learner  can  reject  a  hypothesis 
because  all  of  the  instances  are  typically  labelled  as  posi¬ 
tive  or  negative  with  certainty.  If  the  learner  develops  a 
hypothesis  that  is  consistent  with  all  instances  seen  and  it 
turns  out  that  instance  information  is  occasionally  incor¬ 
rect  due  to  mislabelling,  then  the  resulting  classification 
rule  may  effectively  over-fit  the  instance  data  to  include 
noise  (Spears  and  Gordon,  1992).  In  a  learning  situation 
where  an  instance  is  only  labelled  as  to  how  probable  an 
example  it  is  of  the  target  concept  (i.e.,  a  probabilistic 
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learner),  maintaining  a  version  space  of  consistent  hypoth¬ 
eses  in  some  deterministic  sense  loses  its  meaning. 

In  RLMs,  a  version  space  instead  consists  of  a  single  prob¬ 
abilistic  hypothesis  that  covers  the  space  of  all  possible 
hypotheses.  Equivalently,  we  can  consider  the  version 
space  to  be  set  of  all  hypotheses,  with  each  hypothesis 
given  a  value  indicating  its  degree  of  its  credibility  relative 
to  other  hypotheses.  Let  us  call  this  probabilistic  version 
space  a  credibility  space,^  and  for  convenience,  let  us  call 
the  degree  of  certainty  of  a  hypothesis  its  credibility 
value.  Credibility  values  may  be  decision  weights,  rule 
strengths,  or  transition  probabilities,  depending  upon  the 
RLM. 

In  concept  learners,  a  hypothesis  induces  a  dichotomy  of 
the  instance  space  (Haussler,  1988).  Haussler  defines  the 
growth  function  as  the  maximum  number  of  dichotomies 
that  can  be  induced  in  a  hypothesis  space  for  a  finite  num¬ 
ber  of  instances.  He  then  shows  that  because  the  growth 
function  can  be  related  to  the  number  of  instances  required 
to  e-exhaust  a  version  space,  it  can  be  used  as  a  measure  of 
bias  strength  in  a  probabilistic  sense  (Haussler,  1988,  pp. 
191).  The  analogy  to  the  dichotomies  of  instances  in  con¬ 
cept  learning  is  the  number  of  inputs  and  outputs  available 
to  a  RLM  over  a  decision  sequence.  Even  though  this 
information  does  give  us  a  bound  on  the  size  of  the 
hypothesis  space,  unfortunately,  it  does  not  give  us  a  direct 
bound  the  amount  of  testing  required.  Because  RLMs 
often  use  non-binary  evaluations,  a  RLM  must  consider  a 
trial  hypothesis  repeatedly  to  gain  confidence  in  its  credi¬ 
bility.  It  is  very  unlikely  that  the  RLM  can  reduce  the  cred¬ 
ibility  space  by  testing  a  hypothesis  once;  however,  testing 
does  allow  the  RLM  to  reshape  the  credibility  values  so 
that  retesting  of  poor  hypotheses  is  minimized. 

2.4  INFORMATION  COMPRESSION  AND 
ENTROPY 

After  each  stage  of  the  search,  a  RLM  adjusts  its  credibil¬ 
ity  values  to  reflect  the  outcome  of  its  experience.  Depend¬ 
ing  on  the  policies  employed  by  the  RLM,  these  values  are 
increased  or  decreased  so  that  f^uture  decisions  should  pro¬ 
vide  better  overall  evaluations.  The  update  policy  gener¬ 
ally  indicates  the  rate  of  change  in  the  credibility  values 
(i.e.,  amount  of  change  per  stage).  Depending  on  this  rate, 
the  credibility  space  of  the  hypotheses  becomes  organized 
more  or  less  quickly. 

A  measure  of  the  amount  of  compression  per  stage  is  the 
change  in  the  entropy.  We  can  compute  the  entropy  at  a 


^  Both  Rendell  (Rendell,  1986)  and  Watanabe  (Watanabe,  1960)  use  sim¬ 
ilar  terminology.  Rendell  defines  a  credibility  function  of  hypotheses 
which  assesses  the  credibility  or  belief  of  the  various  competing  hypothe¬ 
ses. 


stage  using  Shannon’s  entropy  measure  (Shannon,  1948, 
pp.  393) 

N 

stage  = 

i=l 

where  K  is  a  positive  constant,  and  N  is  the  current  size  of 
the  credibility  space  at  the  current  stage.  Let  us  assume  for 
convenience  that  K  is  1. 

According  to  Watanabe:  “Inductive  inference  is  a  process 
such  that  the  distribution  of  weights  (credibilities) 
becomes  increasingly  concentrated  on  a  decreasing  num¬ 
ber  of  cases  (hypotheses)  no  matter  how  widely  one  dis¬ 
tributes  the  weights  initially”  (Watanabe,  1960,  pp.  210). 
Watanabe  calls  this  observed  decrease  in  the  entropy  the 
inverse  H-theorem  (Watanabe,  1960, 1975). 

The  principle  of  decreasing  entropy  applies  to  concept 
learners.  Suppose  that  we  start  with  a  hypothesis  space 
having  cardinality  \H\  and  there  is  an  equal  probability  of 
considering  each  of  the  hypotheses.  If  we  can  eliminate 
hypotheses  due  to  having  certain  instance  information,  and 
of  there  is  an  equal  probability  of  inspecting  the  remaining 
hypotheses,  then  the  entropy  in  the  version  space  after 
eliminating  all  but  N  hypotheses  is 

N 

H  =  -y^(l/N)logil/N)  .  (Eqn2) 

i  =  l 

This  term  simply  reduces  to  -logiV.  The  quantity  in  (Eqn 
2)  can  be  interpreted  as  the  “amount  of  ignorance”  of  not 
knowing  which  of  the /V  hypotheses  is  correct  (Watanabe, 
561,  pp.  562).  When  N  =  1 ,  the  uncertainty  reduces  to 
zero. 

In  concept  learners,  the  probabilities  are  not  actually  the 
same  for  the  hypotheses.  Some  hypotheses  are  given  more 
weight  than  others  depending  on  the  procedural  biases. 
Because  many  procedural  biases  only  implicitly  generate  a 
distribution  over  hypotheses  in  the  version  space,  comput¬ 
ing  entropy  for  these  cases  becomes  a  challenge.  In  a  RLM 
such  as  a  CLA,  it  is  possible  to  calculate  entropy  since 
transition  probabilities  can  be  used  to  compute  the  credi¬ 
bility  values  of  trial  hypotheses.  These  credibility  values 
can  be  used  as  probabilities  in  (Eqn  1). 


3  COLLECTIVE  LEARNING 
AUTOMATON  (CLA) 

3.1  OVERVIEW 

The  CLA  is  an  iterative  paradigm  that  refines  its  hj^the- 
ses  of  the  solution  at  each  stage  of  the  search.  Within  each 
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stage,  the  automaton  communicates  with  the  environment 
for  several  interactions.  For  each  interaction,  the  automa¬ 
ton  receives  a  stimulus  input  from  the  environment,  selects 
a  response  output,  collects  the  stimulus-response  pair  into 
a  history,  and  then  transmits  the  output  to  the  environment. 
This  interaction  cycle  repeats  until  the  automaton  gener¬ 
ates  a  sequence  of  stimulus-response  interactions  called  a 
collection.  At  the  end  of  a  stage,  the  environment  trans¬ 
mits  an  evaluation  to  the  automaton.  The  automaton  is 
collective  because  the  evaluation  of  the  decision  sequence 
does  not  occur  until  a  collection  of  stimulus-response  pairs 
are  obtained.^ 

The  CLA  maintains  a  State  Transition  Matrix  (STM). 
The  STM  explicitly  provides  stimulus-response  probabili¬ 
ties  by  partitioning  the  stimulus  space  into  a  discrete  num¬ 
ber  of  compartments  called  stimulants  and  the  response 
space  into  a  discrete  number  of  compartments  called 
respondents.  The  sum  of  the  probabilities  across  respon¬ 
dents  for  a  given  stimulant  is  one.  The  automaton  applies  a 
selection  function  to  choose  a  response  to  an  environmen¬ 
tal  stimulus  based  on  the  current  contents  of  the  STM.  To 
modify  the  STM,  the  automaton  first  develops  a  compen¬ 


^  Sutton  calls  problems  which  restrict  reinfor<«ment  to  occur  at  the  end 
of  a  sequence  “time  blinded  tasks  “  because  the  reinforcement  time  inter¬ 
val  is  very  often  unknown  (Sutton,  1984). 


sation  based  on  an  internal  transformation  of  the  evalua¬ 
tion;  then  the  automaton’s  update  function  changes  the 
STM  probabilities  using  both  the  compensation  and  the 
stimulus-response  information  stored  in  the  history.  The 
cycle  repeats  for  several  stages  until  some  convergence 
criterion  is  met.  Figure  1  summarizes  the  steps  using  high¬ 
lighted  pseudocode. 

The  simplicity  of  the  CLA  makes  it  potentially  amenable 
to  analysis.  Because  the  probabilities  of  a  respondent  is 
given  for  each  stimulant,  it  is  possible  to  compute  an  esti¬ 
mate  of  the  entropy  at  each  stage.  This  estimate  is  called 
the  collection  entropy,  //<-.  To  calculate  the  entropy,  the 
automaton  first  computes,  on  the  fly,  the  product  of  the 
conditional  probabilities  of  selecting  responses  for  all 
interactions  except  for  the  last  For  a  collection  of  length  /, 
the  path  probability,  is 

i-\ 

Ppath  =  Y[Pm-  (Eqn  3) 

m  =  1 

The  automaton  then  uses  this  probability  in  computing 
at  the  final  interaction: 

r 

=  S  log  iPpathPj)  CEqn  4) 

y  =  i 

Some  other  RLMs,  such  as  neural 
networks,  implicitly  represent  the 
relationship  between  stimulants 
and  respondents  by  using  a  linear 
expression  which  maps  the 
weighted  sum  of  the  stimuli  into  a 
response.  Based  on  evaluation 
feedback,  these  learners  adjust  the 
weights  within  the  expression  to 
modify  the  associated  response. 
Still  other  RLMs  use  stimulus- 
response  rules  that  permit  condi¬ 
tion  parts  of  rules  to  intersect. 
This  intersection  corresponds  to 
having  overlapping  stimulants. 
Many  of  these  RLMs  are  interest¬ 
ing  paradigms;  CLA  has  the  vir¬ 
tue  of  having  a  simple  automaton 
underpinning  in  which  the  proba¬ 
bilities  of  input-output  associa¬ 
tions  are  explicitly  enumerated. 
The  results  of  this  study  may  be 
useful  in  examining  other  RLMs. 


COLLECTIVE  LEARNING  AUTOMATON 
BEGIN 

stage  =  0 

Initialize  STM 

WHILE  (convergence  criteria  not  met)  DO 
BEGIN 

interaction  =  0 

ENVIRONMENT 

—  Transmit  response 

WfflLE  (not  end  of  stage)  DO 

BEGIN 

Select  response  for  stimulus 
using  STM 

Collect  <stimulus,  response> 
pair  in  History 

►  Receive  stimulus 

iransmii  response  w 

-  'Trati<iinit  rp^nonse 

interaction  =  interaction  +  1 

END 

Update  Environment 
—  Transmit  evaluation 

Receive  evaluation 

Form  compensation  using  evaluation 
Update  STM  using  History  and 
Compensation 
stage  =  stage  +  1 

END 

END 

Figure  1:  A  Standard  Collective  Learning  Automaton 
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32  LEARNING  BIASES  IN  THE  CLA 

Now  let  us  briefly  consider  examples  of  biases  within  the 
CLA.  A  example  of  a  representational  bias  in  the  automa¬ 
ton  is  the  definition  of  the  states  in  the  STM  (i.e.,  the  stim¬ 
ulants  and  the  respondents).  Expressing  a  hypothesis  as  a 
decision  sequence  is  an  example  of  a  grammar  bias.  Thus, 
the  language  of  a  trial  hypothesis  consists  of  a  sequence  of 
stimulant-respondent  ordered  pairs.  There  are  several 
examples  of  procedural  bias.  The  formulation  of  the  com¬ 
pensation  and  update  functions  reflect  inductive  proce¬ 
dural  biases.  Deductive  procedural  biases  are  reflected  in 
the  selection  function.  For  example,  the  automaton  may 
make  each  decision  based  only  on  current  state  informa¬ 
tion  without  looking  behind  at  previous  information  (i.e., 
the  automaton  may  be  first-order).  The  selection  function 
governing  how  the  automaton  chooses  a  decision  sequence 
(i.e.,  a  hypothesis)  from  the  STM  may  consider  all  hjjioth- 
eses  or  only  a  subset  of  hypotheses  whose  probabilities  lie 
above  a  threshold. 


4  EXPERIMENTAL  APPROACH 


4.1  AN  EXAMPLE  PROBLEM 

For  purposes  of  illustration,  let  us  examine  a  small  shortest 
path  problem.  Figure  2  below  summarizes  the  environ¬ 
ment’s  complete  knowledge  of  the  problem.  The  stimu¬ 
lants  that  the  automaton  can  receive  are  the  colors  within 
the  ellipses.  The  lines  shaded  in  grey  show  the  optimal 
path.  The  respondents  are  X,  Y,  and  Z;  the  time  of  travel 
associated  with  each  response  is  placed  in  parentheses. 

Notice  that  it  is  possible  for  more  than  one  respondent  to 
lead  to  the  same  next  stimulant.  For  example,  for  the  stim¬ 
ulant  called  VIOLET,  all  three  of  the  respondents,  X,  Y, 
and  Z,  lead  to  the  stimulant  called  RED.  The  time  trav¬ 
elled  depends  on  the  actual  respondent  selected.  Also 


notice  that  in  this  problem,  the  shortest  path  in  terms  of 
time  is  the  longest  route  through  the  network  in  terms  of 
the  number  of  decisions  required  (i.e.,  five  decisions). 

It  is  interesting  to  note  that  there  is  an  inherent  hypothesis 
procedural  bias  in  the  automaton  because  it  selects 
responses  on  a  node-by-node  basis.  Even  when  the  proba¬ 
bilities  of  selecting  different  responses  are  initially  the 
same  at  each  of  the  nodes,  the  probabilities  of  selecting  the 
possible  action  sequences  are  different.  There  are  more 
paths  leading  to  the  RED  stimulus  if  the  initial  response  is 
X  (7  paths)  than  if  the  initial  response  is  either  Y  (2  paths) 
or  Z  (4  paths),  so  that  the  paths  starting  with  an  X  response 
are  initially  explored  less  than  the  paths  starting  with  a  Y 
or  Z  responses.  This  bias  is  compounded  if  there  are  near- 
optimal  solutions  in  the  more  easily  explored  parts  of  the 
search  space.  In  Figure  2  the  response  sequence  Y  ->  Y  -> 
Z  gives  a  6  minute  trip  that  is  only  one  less  than  the  opti¬ 
mal  response  sequence  X  ->  Y  ->  X  ->  Y  ->  Z  giving  a  5 
minute  trip.  So  despite  the  small  size  of  the  graph,  the 
problem  is  not  trivial.  It  is  easy  for  a  RLM  receiving  sum¬ 
mary  evaluations  to  get  trapped  in  a  local  optimum. 

Let  us  now  assume  that  the  automaton  does  not  have 
respondents  X,  Y,  and  Z  available  to  it;  rather,  suppose  it 
has  an  integer  range  of  responses  from  4  to  0.  In  other 
words,  the  automaton  has  a  language  based  on  integer  val¬ 
ues  instead  of  letter  symbols.  If  the  automaton  uses  a  parti¬ 
tion  consisting  of  three  intervals  (e.g.,  [4, 2]  [1]  [0]),  then 
each  interval  represents  an  equivalence  class  where  the 
range  of  responses  are  considered  to  be  the  same.  For 
example,  if  a  respondent  is  the  range  [4,2],  then  the 
responses  {4,  3,  and  2}  are  equivalent.  In  general,  there 
are  2"  ways  of  partitioning  the  range  [n,  0]  into  equiva¬ 
lence  classes. 


4.2  EXAMPLE  OF  REPRESENTATIONAL  BIAS 

In  general,  the  best  mapping  between  the  environment’s 
language  and  the  automaton’s  is  not  known.  If  we  consider 
the  automaton’s  specific  partition  of  the  range  to  be  an 
inherited  characteristic,  then  a  partition  can  be  thought  of 
as  an  intrinsic  representational  bias  in  the  automaton’s  sen¬ 
sors,  much  in  the  same  vein  that  a  bat’s  sonar-like  sensors 
vary  in  sensitivity  depending  on  the  range.  This  represen¬ 
tational  bias  determines  the  structure  of  the  automaton’s 
STM.  The  partition  given  to  the  automaton  remains  the 
same  during  its  life-span  (while  it  solves  the  shortest  path 
problem).  The  choice  of  partition  is  one  of  the  experi¬ 
mental  design  parameters. 

Figure  3  depicts  the  representation  lattice  organized  with 
the  most  general  representation  at  the  top,  where  all 
responses  over  the  range  from  4  to  0  are  placed  in  the 
same  equivalence  class  (i.e.,  {4, 3, 2, 1, 0}  =  [4, 0]),  to  the 
most  specific  one  at  the  bottom,  where  each  response  is 
considered  to  be  unique  (i.e.,  [4]  [3]  [2]  [1]  [0]).  Thus,  the 
strength  of  this  representational  bias  can  itself  be  framed 
as  a  search  space  covering  the  most  general  to  the  most 
specific  representation.  As  we  move  down  the  lattice,  each 


line  represents  an  additional  splitting  of  the  range.  When 
the  environment  (or  the  problem)  maps  the  respondents  X, 
Y,  and  Z  in  the  graph  above  into  one  of  the  possible  parti¬ 
tions,  then  that  partition  becomes  the  taiget  representation, 
or  more  specifically,  the  Target  Partition.  For  example,  if 
X  maps  to  [4],  Y  maps  to  [3, 2],  and  Z  maps  to  [1, 0],  then 
the  target  partition  is  [4]  [3,  2]  [1,  0].  An  automaton  that 
happens  to  inherit  the  target  partition  has  the  strongest, 
correct  representational  bias.  Other  representations,  how¬ 
ever,  may  still  permit  the  CLA  to  solve  the  shortest  path 
problem.  Those  partitions  are  also  correct  representations. 
Notice  that  [4]  [3]  [2, 1]  [0]  is  a  correct  partition  because 
there  is  a  unambiguous  mapping  for  each  of  the  environ¬ 
ment’s  symbols:  X  maps  to  [4],  Y  maps  to  [3],  and  Z  maps 
to  [0].  The  range  [2,  1]  is  not  useful  to  the  automaton 
because  the  range  is  ambiguous:  the  target  partition  maps 
Y  to  [2]  and  Z  to  [1].  Combining  2  and  1  together  does  not 
help  the  CLA,  but  the  combination  does  not  hurt  the 
automaton  either  because  there  are  unambiguous  map¬ 
pings  for  the  symbols  Y  and  Z.  In  general,  the  inherited 
partitions  which  allow  the  automaton  to  only  converge  to 
sub-optimal  solutions,  or  to  not  converge  at  all,  are  incor’ 
red  representations. 
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43  EXAMPLE  OF  PROCEDURAL  BIAS 

At  the  end  of  each  stage,  the  environment  informs  the 
automaton  of  the  duration  of  its  selected  path  through  the 
network.  The  automaton’s  compensation  function,  which 
is  defined  internally  to  the  automaton,  transforms  the  envi¬ 
ronment’s  score,  using  an  exponential  function  so  that 
shorter  durations  receive  a  large  reward  and  longer  dura¬ 
tions  receive  very  little  reward/  Thus  the  initial  scaling  of 
the  reward  is 

(Eqn5) 

where  k  is  some  small  constant. 

The  collection  entropy  is  also  included  in  the  compensa¬ 
tion  function.  The  collection  entropy  acts  as  a  progressive 
reinforcement  mechanism  that  effective  rewards  the 
automaton  more  during  early  learning  experiences  and  less 
later  on  as  information  at  decision  points  becomes  more 
certain.  As  a  result,  the  use  of  the  collection  entropy  forces 
the  automaton  to  examine  more  difficult-lo-reach,  yet 
unexplored,  parts  of  the  search  space,  even  though  a  near- 
optimal  solution  may  be  easier  to  locate,  Scott  and  Marko- 
vitch’s  DIDO  system  also  use  an  entropy  measure  to 
decide  what  spaces  to  investigate  (Scott  and  Markovitch, 
1989).  However,  the  CLA’s  measure  incorporates  a  trans¬ 
formation  of  the  environment’s  evaluation  in  order  to  max¬ 
imize  the  reward  of  an  experience. 

Given  a  compensation  factor.  A,  for  adjusting  the  rate  of 
change  of  the  STM  probabilities,  the  compensation,  c,  is 

c  =  (A)  (1.0-//,)/(4).  (Eqn6) 

The  compensation  factor  is  the  procedural  bias  parameter 
modified  in  the  experiments.  All  probabilities  along  the 
path  (decision  sequence)  are  increased  in  proportion  to 
their  current  values  during  the  update  of  the  STM.  In  the 
experiment  discussed  in  the  results  section,  different  levels 
of  the  compensation  factor  are  examined  in  order  to  study 
the  effect  of  applying  different  strengths  of  a  procedural 
bias.  By  increasing  or  decreasing  A,  the  probabilities  in 
the  STM  become  organized  more  or  less  quickly.  These 
transitional  probabilities  in  turn  affect  the  organization  of 
the  credibility  space  of  the  hypotheses,  where  a  trial 
hypothesis  is  considered  to  be  a  particular  sequence  of 
stimulus-response  pairs.  With  a  stronger  procedural  bias, 
the  entropy  of  the  credibility  space  decreases  faster.  How¬ 
ever,  too  fast  a  rate  of  decrease  can  cause  the  procedural 
bias  to  be  so  strong  that  it  is  incorrect  In  other  words,  too 
strong  a  a  compensation  factor  may  lead  to  a  sub-optimal 
solution. 


^  This  compensation  function  is  similar  to  a  reward-inaction  policy, 
where  desirable  behaviors  are  rewarded  and  non-desirable  behaviors 
receive  no  reward  instead  of  being  penalized. 


4A  DESIGN 

In  this  study,  we  examine  the  16  partitions  discussed 
above  in  combination  with  50  levels  of  compensation  fac¬ 
tor,  ranging  from  0.01  to  0.50.  Each  combination  of  parti¬ 
tion  and  compensation  factor  level  is  repeated  20  times  in 
order  to  obtain  average  performance  values. 

Two  criteria  must  be  satisfied  to  reach  convergence:  (1) 
the  fraction  of  runs  over  the  last  200  stages  having  the 
optimal  solution  must  be  greater  than  or  equal  to  0.99,  and 
(2)  the  difference  in  the  collection  entropy  between  stages 
over  the  200  stages  is  less  than  some  small  value  epsilon 
(e.g.,  epsilon  =  0.0005). 

After  achieving  convergence,  statistics  are  obtained  over  a 
window  of  an  additional  50  stages.  Two  performance  mea¬ 
surements  are  taken:  (1)  the  number  of  stages  required  to 
reach  convergence  (i.e.,  the  last  stage  of  the  window),  and 
(2)  the  time-average  of  the  exponentially  transformed 
score  /(^)  (see  Eqn  5).  The  automaton  is  permitted  to  run 
up  to  15, 000  stages  when  there  is  no  convergence. 


5  RESULTS 


5.1  MEASURING  THE  CLA’S  OVERALL 
PERFORMANCE  (UTILITY) 

Figure  4  illustrates  the  change  in  the  collection  entropy 
from  stage  to  stage  for  three  runs  using  selected  levels  of 
compensation  factor:  0.05,  0.14,  and  0.50.  The  collection 
entropy  fluctuates  as  the  automaton  tests  different  trial 
hypotheses  at  each  sage.  Eventually,  as  the  probability  of 
selecting  one  solution  becomes  more  certain,  the  entropy 
dramatically  decreases  and  the  fluctuations  lessen.  (If 
there  were  more  than  one  solution,  the  change  in  the  fluc¬ 
tuation  would  remain  constant).  As  the  level  of  compensa¬ 
tion  becomes  larger,  the  number  of  stages  required  to 
reduce  the  entropy  becomes  smaller. 

The  top  graph  in  Figure  5  summarizes  the  essence  of  the 
information  in  Figure  4.  Figure  5  shows  the  number  of 
stages  required  for  the  five  correct  partitions  (i.e.,  [4]  [3, 2] 
[1, 0];  [4]  [3]  [2, 1]  [0];  [4, 3]  [2]  [1,  0];  [4]  [3]  [2]  [1, 0], 
and  [4]  [3]  [2]  [1]  [0])  for  each  level  of  compensation  fac¬ 
tor  over  an  average  of  20  runs  for  each  combination.  The 
target  partition’s  line  is  dotted;  each  point  has  a  vertical 
line  indicating  plus  and  minus  one  standard  deviation 
about  the  mean.  An  analysis  of  variance  and  accompany¬ 
ing  t-tests  indicate  that  there  is  no  significant  difference 
among  the  correct  representations  most  of  the  time. 

The  bottom  graph  in  Figure  5  shows  the  convergence  val¬ 
ues  of  the  different  partitions  for  each  level  of  compensa¬ 
tion  factor.  These  convergence  values  have  been 
normalized  so  that  the  best  obtainable  performance  is  one, 
and  the  worst  obtainable  one  is  zero.  The  strongest,  incor- 
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COLLECTION  ENTROPY  OVER  THE  STAGES 
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Figure  4:  Example  Runs  Showing  Change  in  Collection  Entropy  Over  Stages 
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STAGES  FOR  CONVERGENCE 
Correct  Partitions  Only 


CONVERGENCE 

VALUES 


CONVERGENCE  VALUES 
All  Partitions 


Figure  5:  Stages  for  Convergence  and  Convergence  Values  at  Different  Levels  of  Compensation  Factor 
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rect  partition,  [4,  0],  lies  along  the  compensation  factor 
axis  at  a  zero  normdized  conveigence  vaJue.  The  top  five 
lines  correspond  to  the  correct  representations;  the  remain¬ 
ing,  more  spread  out,  lines  are  incorrect  partitions.  The 
dotted  line  indicates  the  target  partition;  with  the  vertical 
lines  indicate  plus  and  minus  one  standard  deviation  about 
each  plotted  point.  An  analysis  of  variance  and  accompa¬ 
nying  t-tests  indicate  that  there  are  no  significant  differ¬ 
ences  among  the  correct  representations.  Notice  that  the 
overall  convergence  values  decrease  for  the  correct  parti¬ 
tions  with  a  high  level  of  compensation  factor  (e.g.,  0.50). 
Even  though  it  takes  fewer  stages  to  reach  convergence 
(around  500),  the  procedural  bias  is  so  strong  that  some  of 
the  runs  are  converging  suboptimally,  thus  bringing  down 
the  average  convergence  values. 


6  CONCLUSION 

Normally,  we  would  expect  that  in  the  top  graph  of  Figure 
5  the  correct  representations  would  vary  in  their  number  of 
stages  for  convergence,  depending  on  the  strength  of  the 
representation.  For  example,  the  target  partition,  being  the 
strongest,  correct  partition,  should  have  a  slightly  faster 
rate  of  convergence  that  is  significant  when  compared  to 
the  other  correct  partitions.  In  this  problem,  there  is  not  a 
significant  difference  between  the  representations  most  of 
the  time.  On  the  other  hand,  we  would  not  expect  the  con¬ 
vergence  values  in  the  bottom  graph  of  Figure  5  to  be  sig¬ 
nificantly  different  for  the  correct  partitions,  since  by 
definition  of  being  correct,  all  of  these  partitions  should 
allow  the  automaton  to  convergence  to  the  optimal  solu¬ 
tion. 

In  the  bottom  graph  of  Figure  5,  the  incorrect  partitions, 
which  generally  converge  to  sixty  percent  or  less  of  the 
optimal  convergence  value,  also  exhibit  tremendous  vari¬ 
ance  in  their  values  (this  variance  is  not  displayed  in  the 
graph).  As  a  result,  it  is  difficult  to  discriminate  among  the 
incorrect  partitions,  as  we  might  expect  to  do,  based  on 
their  normalized  convergence  values.  Incorrect  partitions 
rarely  permit  the  automaton  to  converge  to  even  a  sub- 
optimal  solution.  For  incorrect  partitions,  the  convergence 
values  are  primarily  the  result  of  averaging  the  last  fifty 
stages  over  15,0(X)  stage  runs. 

In  summary,  correct  partitions  could  not  be  discriminated 
on  the  basis  of  the  number  of  stages  required  for  conver¬ 
gence.  Neither  could  incorrect  partitions  be  discriminated 
based  on  their  convergence  values.  One  of  the  possible 
reasons  for  this  result  is  that  the  example  problem  uses  dif¬ 
ferent  symbols  for  the  same  paths  through  the  network 
(i.e.,  word  similes).  The  problem  does  not  show  the  effect 
of  combining  different  paths  giving  the  same  performance 
into  groups.  It  may  be  that  the  use  of  similes  does  not 
degrade  performance.  Future  research  needs  to  address 


different  senses  of  what  is  meant  by  combining  terms  into 
higher  level  ones. 

Another  problem  that  needs  to  be  explored  is  the  definition 
of  convergence  within  the  CLA.  It  may  be  that  the  current 
definition  of  convergence  inherently  yields  results  having 
high  variance.  A  different  definition  of  convergence  may 
permit  the  automaton  to  consistently  converge  by  same 
number  of  stages  (or  close  to  the  same)  when  using  the 
same  combination  of  parameters. 

This  paper  reviews  the  ideas  of  inductive  learning  biases 
in  the  context  of  an  example  RLM  called  a  CLA.  For  pur¬ 
poses  of  illustration,  the  paper  uses  simple  shortest  path 
problem.  In  particular,  the  experimental  work  examines 
the  performance  of  the  automaton  for  various  combina¬ 
tions  of  strengths  in  representational  and  procedural  bias. 
The  representational  bias  is  the  partitioning  of  the 
response  range  used  within  the  STM  of  the  CLA;  the  pro¬ 
cedural  bias  is  the  compensation  factor  which  determines 
the  amount  of  increase  in  the  probabilities  within  the 
STM.  The  work  also  introduces  the  use  of  entropy  as  a 
measure  of  bias  strength,  with  particular  emphasis  on  the 
strength  of  the  procedural  bias. 

The  work  demonstrates  some  of  the  problem  of  discover¬ 
ing  empirical  measures  of  bias  strength  in  an  example 
RLM.  Other  tests  cases  need  to  be  explored  in  order  to  dis¬ 
cover  when  stronger  representations  can  be  ascertained 
empirically  through  performance  measures. 
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